This specific hurricane dataset was taken from Kaggle, which in turn was taken from National Hurricane Center (NHC). This dataset includes all tropical cyclones (tropical depression, tropical storm, hurricane, major hurricane) dated all the way back from 1851 to 2015. The most recent 2016 and 2017 Atlantic hurricane seasons weren’t included in the data.
With time series data, there are a lot of questions we could ask. We haven’t done much time series analysis in either EDA or in Stats, so we didn’t have too much technical expertise to do a lot of statistical analysis with the data. We decided on focusing purely on the visualization of the hurricanes, and more specifically, ask the questions:
1) Where do hurricanes form the most throughout the Atlantic ocean? Does this change throughout time?
2) Can we predict where a tropical cyclone could end up based on the origin of the system?
3) Which hurricanes shared the most similar paths? Can we find patterns or trends on those hurricanes?
For some of the questions, we created an R Shiny application which would allow a user to enter their desired system based on year, and find all the relevant visualizations associated with it. For the other questions, we created some very interesting plots that reveal a lot about these magnificent storms.
First, we imported and cleaned the data. This was crucial since this is time series data and there were things we needed to change specifically with the dates that would let R read them properly. We used the ‘lubridate’ package that made this process fairly easy.
hur <- fread(file.path("C:/Users/susmani/Documents/Year 5/Courses/Data Munging/GroupProject2/hurricane.csv"),na.strings = c("PrivacySuppressed", "NULL"))
hur<-data.frame(hur)
hur$Date<-ymd(hur$Date)
hur$Latitude<-as.numeric(unlist(strsplit(hur$Latitude, split='N', fixed=TRUE)))
hur$Longitude<-as.numeric(unlist(strsplit(hur$Longitude, split='W', fixed=TRUE)))
hur$Longitude<-(-1)*hur$Longitude
In the above code, we read in the file, converted it into a data frame that would easily be read. We then used the lubridate package to change the dates and times in this set to match the format that R likes. Lastly, and probably most important for graphing geographical maps of frequency and tracks of hurricanes, we needed to change the Latitude and Longitude points into the format which ggplot2 and ggmap enjoys.
The cleaning, as you can see, is very simple, but it allows us to create some truely mesmerizing visualizations.
The structure of our clearned dataset has over 50,000 single observations of system data throughout time, with over 1,000 unique cases. There are 22 variables in this data set, but we only focus on a few: ID, Year, Date, Time, Latitude, Longitude, Minimum Pressure, Maximum Pressure.
With this knowledge, we can continue to explore - and munge.
First, we wanted to see what some tracks we could look at. To create some gorgeous and interactive plots, we used a combination of mapbox, an online mapping tool, and plotly, an interactive plotting tool in R. To use mapbox with plotly, we need to set a system token that links to Saad’s mapbox account:
Sys.setenv('MAPBOX_TOKEN' = 'pk.eyJ1Ijoic3VzbWFuaSIsImEiOiJjamFhZnJlb3YwczdmMzJxaXlmcHJ0ZGZ2In0.vdCC--CzL7cM-XsG8yCrFw')
Then, we can focus on one specific hurricane, like Hurricane Katrina from 2005, and plot its track:
p <- plot_mapbox(mode = 'scattermapbox') %>%
add_markers(
data = hur[hur$Date >= "2005-01-01" & hur$Name=='KATRINA',], x = ~Longitude, y = ~Latitude, text=~paste('Name: ', Name, '<br>Max Wind:', Maximum.Wind, '<br>Date: ', Date), split=~Name,
size = ~Maximum.Wind, alpha = 0.5, color=~Date) %>%
layout(
plot_bgcolor = '#191A1A', paper_bgcolor = '#191A1A',
mapbox = list(style = 'dark',
zoom = 3,
center = list(lat = median(hur$Latitude),
lon = median(hur$Longitude))),
margin = list(l = 0, r = 0,
b = 0, t = 0,
pad = 0),
showlegend=FALSE)
p
Looks pretty good! As maximum wind increases over time, so does to size of the dots. You can hover over the data points and get the maximum wind and date at the point in time.
We can create even more complex and colorful graphs that can really grab anyone’s attention. We plotted one track, but why don’t we try to graph the tracks of all major hurricanes that formed in September?
p2 <- plot_mapbox(mode = 'scattermapbox') %>%
add_markers(
data = hur[format.Date(hur$Date, "%m")=="09" & hur$Maximum.Wind>=96,], x = ~Longitude, y = ~Latitude, text=~paste('Name: ', Name, '<br>Max Wind:', Maximum.Wind, '<br>Date: ', Date), color=~ID,
size = ~Maximum.Wind, alpha = 0.5) %>%
layout(
plot_bgcolor = '#191A1A', paper_bgcolor = '#191A1A',
mapbox = list(style = 'dark',
zoom = 3,
center = list(lat = median(hur$Latitude),
lon = median(hur$Longitude))),
margin = list(l = 0, r = 0,
b = 0, t = 0,
pad = 0),
showlegend=FALSE)
p2
Again, we used mapbox and plotly in conjunction to specify hurricane winds that were only above 96 knots (equal to over 110 MPH) and only in the month of September. We also colored each track by their specific ID so we can distinguish tracks better.
We can also create histograms:
(p3 <- plot_ly(alpha = 0.6) %>%
add_histogram(x = hur$Maximum.Wind[format.Date(hur$Date, "%m")=="09" & hur$Maximum.Wind>0], name = 'September', autobinx=TRUE) %>%
add_histogram(x = hur$Maximum.Wind[format.Date(hur$Date, "%m")=="06" & hur$Maximum.Wind>0], name = 'June') %>%
add_histogram(x = hur$Maximum.Wind[format.Date(hur$Date, "%m")=="07" & hur$Maximum.Wind>0], name = 'July') %>%
add_histogram(x = hur$Maximum.Wind[format.Date(hur$Date, "%m")=="08" & hur$Maximum.Wind>0], name = 'August') %>%
layout(barmode = "stack"))
Let’s look at frequency of hurricanes by month in a 2D Density heat map.
##Creating a heatmap of origin of all hurricane data
atlantic_map<-get_map(location = "puerto rico", zoom = 4)
hur.aug.freq<-hur[format.Date(hur$Date, "%m")=="08",]
hur.sep.freq<-hur[format.Date(hur$Date, "%m")=="09",]
hur.oct.freq<-hur[format.Date(hur$Date, "%m")=="10",]
hur.nov.freq<-hur[format.Date(hur$Date, "%m")=="11",]
par(mfrow = c(2,2))
#second, plotting density maps across the
ggmap(atlantic_map, extent = "device") + geom_density2d(data = hur.aug.freq, aes(x = Longitude, y = Latitude), size = 0.3) + stat_density2d(data = hur.aug.freq, aes(x = Longitude, y = Latitude, fill = ..level.., alpha = ..level..), size = 0.01,bins = 16, geom = "polygon") + scale_fill_gradient(low = "green", high = "red") + scale_alpha(range = c(0, 0.3), guide = FALSE)
ggmap(atlantic_map, extent = "device") + geom_density2d(data = hur.sep.freq, aes(x = Longitude, y = Latitude), size = 0.3) + stat_density2d(data = hur.sep.freq, aes(x = Longitude, y = Latitude, fill = ..level.., alpha = ..level..), size = 0.01,bins = 16, geom = "polygon") + scale_fill_gradient(low = "green", high = "red") + scale_alpha(range = c(0, 0.3), guide = FALSE)
ggmap(atlantic_map, extent = "device") + geom_density2d(data = hur.oct.freq, aes(x = Longitude, y = Latitude), size = 0.3) + stat_density2d(data = hur.oct.freq, aes(x = Longitude, y = Latitude, fill = ..level.., alpha = ..level..), size = 0.01,bins = 16, geom = "polygon") + scale_fill_gradient(low = "green", high = "red") + scale_alpha(range = c(0, 0.3), guide = FALSE)
ggmap(atlantic_map, extent = "device") + geom_density2d(data = hur.nov.freq, aes(x = Longitude, y = Latitude), size = 0.3) + stat_density2d(data = hur.nov.freq, aes(x = Longitude, y = Latitude, fill = ..level.., alpha = ..level..), size = 0.01,bins = 16, geom = "polygon") + scale_fill_gradient(low = "green", high = "red") + scale_alpha(range = c(0, 0.5), guide = FALSE)
We see some interesting relationships between the peak months of August - November. The second week of Semptember is considered the climatogical peak of the season, and as we can seethe heatmap from that month is practically all over the place and covering most of the Atlantic ocean and southern coasts.
It’s even more interesting to see how close August and September tracks seem to be, with the difference seemingly being the frequency. Comparing this to October and November where there is more of a concentration with Carribean tracks. I’m sure there is more of an atmospheric and meteorogical explanation for this.
That looked fantastic, but can we see where the maximum winds were located for all systems?
hur_new<-hur[hur$Maximum.Wind>0 & hur$Minimum.Pressure>0,] #Getting rid of negative values
max_winds<-data.frame(hur_new) %>%
group_by(ID, Minimum.Pressure, Latitude, Longitude) %>%
dplyr::summarise(Maximum.Wind = max(Maximum.Wind))
max_winds.major<-max_winds[max_winds$Maximum.Wind>96,]
par(mfrow = c(1,1))
For all systems and where they got their peak strength:
ggmap(atlantic_map, extent = "device") + geom_density2d(data = max_winds, aes(x = Longitude, y = Latitude), size = 0.3) + stat_density2d(data = max_winds, aes(x = Longitude, y = Latitude, fill = ..level.., alpha = ..level..), size = 0.01,bins = 16, geom = "polygon") + scale_fill_gradient(low = "green", high = "red") + scale_alpha(range = c(0, 0.3), guide = FALSE)
For all major hurricanes [>96 kt (110 mph)] and where they accumlated their peak strength:
ggmap(atlantic_map, extent = "device") + geom_density2d(data = max_winds.major, aes(x = Longitude, y = Latitude), size = 0.3) + stat_density2d(data = max_winds.major, aes(x = Longitude, y = Latitude, fill = ..level.., alpha = ..level..), size = 0.01,bins = 16, geom = "polygon") + scale_fill_gradient(low = "green", high = "red") + scale_alpha(range = c(0, 0.3), guide = FALSE)
It’s interesting to see that for all systems (hurricanes, tropical storms, tropical depression), the peak strength was along the east coast. But, when we only account for major hurricanes, we see a long strip through the upper carribean and Bahamas where storms accumulated their peak strength. It’s interesting to also see that both Category 5 hurricanes from 2017 (Irma and Maria) from this year also accumulated their maximum wind along this strip.
Here, we look at Maximum Wind vs. Minimum Pressure:
hur_new<-hur[hur$Maximum.Wind>0 & hur$Minimum.Pressure>0,] #Getting rid of negative values
max_winds2<-data.frame(hur_new) %>%
group_by(ID, Minimum.Pressure) %>%
dplyr::summarise(Maximum.Wind = max(Maximum.Wind))
m <- loess(Maximum.Wind ~ Minimum.Pressure, data = max_winds2)
(min_p <- plot_ly(max_winds2, x = ~Minimum.Pressure, color = 'rgb(255, 0, 0)') %>%
add_markers(y = ~Maximum.Wind, text = rownames(max_winds2), showlegend = FALSE, opacity = 0.5) %>%
add_lines(y = ~fitted(loess(Maximum.Wind ~ Minimum.Pressure)),
line = list(color = 'rgba(7, 164, 181, 1)'),
name = "Loess Smoother") %>%
add_ribbons(data = augment(m),
ymin = ~.fitted - 1.96 * .se.fit,
ymax = ~.fitted + 1.96 * .se.fit,
line = list(color = 'rgba(7, 164, 181, 0.05)'),
fillcolor = 'rgba(7, 164, 181, 0.2)',
name = "Standard Error") %>%
layout(xaxis = list(title = 'Minimum Pressure (mb)'),
yaxis = list(title = 'Maximum Winds (kt)'),
legend = list(x = 0.80, y = 0.90)))
summary(lm(Maximum.Wind ~ Minimum.Pressure, data = max_winds2))
##
## Call:
## lm(formula = Maximum.Wind ~ Minimum.Pressure, data = max_winds2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.051 -5.927 0.282 6.443 54.002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.301e+03 4.937e+00 263.5 <2e-16 ***
## Minimum.Pressure -1.256e+00 4.988e-03 -251.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.35 on 10550 degrees of freedom
## Multiple R-squared: 0.8573, Adjusted R-squared: 0.8573
## F-statistic: 6.34e+04 on 1 and 10550 DF, p-value: < 2.2e-16
Thus, we can see that the maximum winds of a tropical system are highly correlated with a lower minimum pressure. In fact, for every 1.25 mb decrease in minimum pressure, we increase by 1 kt (1.15 mph) in maximum wind for a given time.
We created an R Shiny Application that would help us visualize different systems and answer some questions we had about the data. The current version of the application has three visualizations that give us insight about different types of tropical systems based on origin, closest tracks, and individual Latitude/Longitude points grouped by windspeed.
\(\textbf{Closest Tracks}\): This does something similar to the origin visualization, but instead of just creating a boundary box on the origin, lets create a boundary box on the \(\textbf{entire}\) specified track, find all the points that are contained in that box, and count in descending order which tracks had the most points in the boundary box. This was the most simple way to find the closest tracks for a given system. We could have used a more mathematical approach by reducing sums of squares by distance (or other methods), but this was the most simplest algorithm to do what we wanted. This tab also gives you a table of those specific tracks that were most similar to the one you chose.
\(\textbf{Path Finder}\): This is a more useful function and interface one could use for future systems. You could pick a latitude, longitude, and a current wind speed and it would create another boundary box within the +/1 latitude and longitude within that specific point, and any points that had winds that were +/- 5 kt. Then it lists the first 5 tracks in descending order by how close a system is close to the wind specified.